In this tutorial, we will be fitting Bayesian distance sampling-adjusted log-Gaussian Cox process models to the survey data using the inlabru package. This will provide us with the pre-requisite knowledge needed to fit integrative joint log-Gaussian Cox process models with both survey and opportunistic sightings data reported by whale watch companies!
Let’s load all the required R packages. See our tutorial 0_Installing_packages.html for installation information.
library(rgeos)
library(rgdal)
library(sp)
library(maptools)
library(spatstat)
library(spdep)
library(INLA)
library(inlabru)
library(readxl)
library(lubridate)
library(ggmap)
library(raster)
Using getwd() check that you are inside the folder ‘DFO_SDM_Workshop_2020’ that we downloaded in the previous tutorial. If not, change this by navigating to the correct folder in the bottom right panel (folder view), opening it, and then clicking “Set as Working Directory” under the tab ‘More’.
Now we can load in the precompiled data and functions
list2env(readRDS('./Data/Modelling_Data.rds'),globalenv())
## <environment: R_GlobalEnv>
source('utility_functions.R')
Finally, let’s turn off all warnings associated with coordinate reference systems.
rgdal::set_rgdal_show_exportToProj4_warnings(FALSE)
rgdal::set_thin_PROJ6_warnings(TRUE)
options("rgdal_show_exportToProj4_warnings"="none")
JOE: I’m assuming some of these things will be discussed in the lectures, but some of the concepts that you are talking about here will be really difficult to grasp to an average ecologist. I think it’s worth defining the important terms again here, e.g., Guassian random fields. You should NOT defined the terms that are not essential e.g. SPDE. FOr those just stating they are a trick, as you did, is sufficient.
Log-Gaussian Cox process models require the specification of Gaussian random fields. Gaussian random fields can become computationally prohibitive. We will use a smart computational trick, which approximates a stochastic partial differential equation (SPDE) with a Gaussian Markov random field (GMRF) evaluated across a delauney triangulation mesh. This approach was developed by Lindgren and Rue (2011) (Lindgren being an author of the inlabru package).
The first step to using this approach is creating the mesh. Good rules of thumb are as follows:
define maximum triangle lengths no shorter than half of the expected spatial range of your problem. Spatial range here is loosely defined as the ‘distance required between two locations for the whale intensity at these two locations to be approximately independent of each other’. Note that it is impossible to detect clusters or hotspots with sizes smaller than the maximum length of triangles used.
Do not define the minimum triangle lengths too small. This could lead to very large meshes! This is specified with the argument cutoff.
Use a minimum triangle angle of 21 degrees.
We define a first mesh using the function inla.mesh.2d from the INLA package. We choose a minimum triangle edge length of 18km, a maximum triangle edge length of 45km, a minimum angles of 25 degrees and we specify that we want a hard boundary equal to the coastline (as defined by the SpatialPolygonDataFrame DOMAIN). Note that we can plot meshes using gg as before1.
mesh_land <- inla.mesh.2d(boundary = Domain,
min.angle = 25,
max.edge = 45,
cutoff = 18)
ggplot() + gg(mesh_land) + gg.spatiallines_mod(Effort_survey,colour='red')
Note that spatial correlations < 18km will not be detectable! We can print the number of mesh vertices (directly related to the length of time required later to fit the model).
mesh_land$n # lots of triangle vertices can lead to slow computation time. 551 good
## [1] 551
We have a potential serious issue with this mesh. The coastline is jagged, which could lead to numerical instabilities to occur later. One easy solution is to buffer and/or smooth the coastline shapefile, making sure that all survey lines and sightings fall within the new modified shapefile. We do this below, by first looking at the right buffer size.
# Buffered by 5km
ggplot() + gg(gBuffer(Domain, width=5)) + gg.spatiallines_mod(Effort_survey,colour='red')
gContains(gBuffer(Domain, width=5), Effort_survey)
## [1] FALSE
# Doesn't contain all the survey lines
# Buffered by 10km
ggplot() + gg(gBuffer(Domain,width=10)) + gg.spatiallines_mod(Effort_survey,colour='red')
gContains(gBuffer(Domain,width=10), Effort_survey)
## [1] FALSE
# Again, doesn't contain all the survey lines
# Buffered by 15km
ggplot() + gg(gBuffer(Domain,width=15)) + gg.spatiallines_mod(Effort_survey,colour='red')
gContains(gBuffer(Domain,width=15), Effort_survey)
## [1] TRUE
# Success, does contain all the survey lines!
Looking at the buffered coastline, we see that the boundary is quite ‘wiggly’. This may prove problematic when trying to fit a series of triangles around it. To ‘straighten out’ the boundary, we can ‘simplify’ it using gSimplify. The amount of ‘straightening’ is specified by the tolerance argument tol.
# Use gSimplify to smooth the edges
ggplot() + gg(gSimplify(gBuffer(Domain,width=15),tol=5)) + gg.spatiallines_mod(Effort_survey,colour='red')
gContains(gSimplify(gBuffer(Domain,width=15),tol=5), Effort_survey)
## [1] TRUE
Domain_smoothed <- gSimplify(gBuffer(Domain,width=15),tol=5) # again all the survey tracklines are contained in the modified boundary
Using this new buffered and smoothed domain/study area shapefile (i.e. Domain_smoothed), we create below a new mesh named mesh_land.
mesh_land <- inla.mesh.2d(boundary = Domain_smoothed,
min.angle = 25,
max.edge = 45,
cutoff = 18)
mesh_land$crs <- Can_proj
mesh_land$n
## [1] 602
ggplot() +
gg(mesh_land) +
gg.spatiallines_mod(Effort_survey,colour='red')
One may be tempted to define the mesh within a much larger convex hull around the coastline, but see Additional tips for explanation of why this is an inadequate approach.
Next, we need to redefine our covariates. Our covariates depth and slope were defined on the original (un-buffered) domain as defined by the SpatialPolygons Domain. Since we have buffered (i.e. extended) our coastline, we need to sensibly define covariate values at the extension locations. We choose to ‘fill-in’ these points with nearest neighbour imputations. These values will not significantly affect the inference, as the effective area covered by these ‘extension regions’ are very small once we downweight by the effort surface \(\lambda_{eff}\).
JOE: there is too much on this chunk of code. You’ll see on Github that I’ve made a suggestion: just create the transformed variables in the previous tutorial (which we kind of do, 1_Exploratory_Analysis) and explain that this is what we will use throughout, and just add them to the data we here and focus on the domain part. Make sure they have the same name in all tutorial. I think right now they don’t.
# Create the log depth covariate as in Session 1
log_Depth <- Bathym
log_Depth$FIWH_MAR_Bathymetry[log_Depth$FIWH_MAR_Bathymetry >= 0] <- 0
log_Depth$FIWH_MAR_Bathymetry <- log(1-log_Depth$FIWH_MAR_Bathymetry)
names(log_Depth) <- 'log_Depth'
## Now to redefine the covariate across our modified domain
# 1) Define a set of pixels across our modified domain
pixels_Domain <- pixels( mesh_land,mask=gSimplify(gBuffer(Domain,width=19),tol=3),
nx=400,ny=400)
# 2) Extract values of the covariate at the new pixel locations
pixels_Domain$log_Depth <- over(pixels_Domain,log_Depth)$log_Depth
# 3) impute missing values with the nearest neighbour value
pixels_Domain$log_Depth[is.na(pixels_Domain$log_Depth)] <-
log_Depth$log_Depth[nncross(as(SpatialPoints(pixels_Domain@coords[which(is.na(pixels_Domain$log_Depth)),]),'ppp'),
as(SpatialPoints(log_Depth@coords),'ppp'),
what = 'which')]
log_Depth <- pixels_Domain
# 4) Create a squared log depth covariate (for later)
log_Depth_sq <- log_Depth
names(log_Depth_sq) <- 'log_Depth_sq'
log_Depth_sq$log_Depth_sq <- log_Depth_sq$log_Depth_sq^2
# Repeat the above for slope
log_Slope <- Slope
log_Slope$FIWH_MAR_Slope <- log(1+log_Slope$FIWH_MAR_Slope)
names(log_Slope) <- 'log_Slope'
pixels_Domain <- pixels( mesh_land,mask=gSimplify(gBuffer(Domain,width=19),tol=3),
nx=400,ny=400)
pixels_Domain$log_Slope <- over(pixels_Domain,log_Slope)$log_Slope
pixels_Domain$log_Slope[is.na(pixels_Domain$log_Slope)] <-
log_Slope$log_Slope[nncross(as(SpatialPoints(pixels_Domain@coords[which(is.na(pixels_Domain$log_Slope)),]),'ppp'),
as(SpatialPoints(log_Slope@coords),'ppp'),
what = 'which')]
log_Slope <- pixels_Domain
log_Slope_sq <- log_Slope
names(log_Slope_sq) <- 'log_Slope_sq'
log_Slope_sq$log_Slope_sq <- log_Slope_sq$log_Slope_sq^2
# Plot the covariates with mesh overlayed
ggplot() + gg(log_Depth) + gg(mesh_land)
ggplot() + gg(log_Slope) + gg(mesh_land)
Next, we need to define our Gaussian random field model and its priors. To do this, we use the function inla.spde2.pcmatern from the INLA package. Note that the arguments prior.sigma and prior.range define arguments for the penalized complexity (PC) prior of the random field developed by Simpson et al. (2017) and Fuglstad et al. (2019).
We are required to define PC priors, on the spatial range \(\rho\) (the wiggliness) and marginal standard deviation \(\sigma\) (the magnitude) of the field. PC priors take the form of prior probabilistic statements on quantiles, providing a user-friendly way of defining sensible, interpretable priors.
Before defining our priors, let’s remind ourselves how the Gaussian random field is affected by the range and marginal standard deviation. Below is a chunk of code defining a function (GRF_sim()) that that simulates and plots a single Gaussian random field on the unit square. You are asked to specify the range and marginal standard deviation:
# Specify spatial range and variance
sim_range <- 0.1
sim_sd <- 2
# Function for simulating and plotting a GRF on a unit square
GRF_sim <- function()
{
# define inla mesh
mesh <- inla.mesh.2d(loc.domain = cbind(c(0,1,1,0),c(0,0,1,1)),
min.angle = 21, max.edge = 0.04, cutoff = 0.02)
plot(mesh)
mesh$n
# define spde
spde_obj <- inla.spde2.pcmatern(mesh, constr = T,
prior.range = c(sim_range, 0.5),
prior.sigma = c(sim_sd, 0.5))
# define indices
index <- inla.spde.make.index('smooth',n.spde = spde_obj$n.spde)
# define prediction matrix
A_proj_pred <- inla.spde.make.A(mesh, loc = mesh$loc[,c(1,2)])
# simulate GRF
Q = inla.spde2.precision(spde_obj, theta = c(log(sim_range), log(sim_sd) ))
field <- inla.qsample(Q=Q)
field <- field - mean(field)
plot_pixels <- pixels(mesh, mask=as(owin(),'SpatialPolygons'), nx=150, ny=150)
# predict the values onto a high res grid
A_proj_grid <- inla.spde.make.A(mesh, loc=plot_pixels@coords)
plot_pixels$field <- as.numeric(A_proj_grid %*% field)
colsc <- function(...) {
scale_fill_gradientn(colours = rev(RColorBrewer::brewer.pal(11,"RdYlBu")),
limits = range(..., na.rm=TRUE))
}
ggplot() + gg(plot_pixels) + colsc(plot_pixels$field) +
ggtitle(paste0('SD = ',sd,' Range = ',range))
}
To simulate a field with the specified SD and range run
GRF_sim()
With the understanding of what the SD and range define, we are asked to define the following two probabilistic statements (remember we are working in units of km):
\[P(\rho < \rho_0) = p_{\rho} \\ P(\sigma > \sigma_0) = p_{\sigma}.\] The goal is to specify the lower tail quantile \(\rho_0\) and probability \(p_{\rho}\) for the range parameter and the upper tail quantile \(\sigma_0\) and probability \(p_{\sigma}\) for the marginal standard deviation. We choose these 4 parameters to help ensure that the field does not become to ‘complex’; specifically we want to avoid the field becoming too ‘wiggly’ or ‘large’. The PC prior shrinks the spatial field towards the base model of no spatial variance and variance zero, hence the name ‘Penalized Complexity’.
We now define our model, with \(\rho_0\), \(p_{\rho}\), \(\sigma_0\), and \(p_{\sigma}\) fixed at 35km, 0.01, 2, and 0.01 respectively.
# First define the SPDE model
matern_land <- inla.spde2.pcmatern(mesh_land,
prior.sigma = c(2, 0.01),
prior.range = c(35, 0.01))
Next, we must now define our detection probability function. First, we define a half-normal detection probability function. This object will take the form of a base R function.
# Define a half-normal detection probability function (code from inlabru tutorials)
log_hn = function(distance, lsig){
-0.5*(distance/exp(lsig))^2
}
In addition, it is often advised in distance sampling applications to threshold our upper detection distances at a ‘sensible’ value. Let’s plot our observed distances from the trackline.
hist(Sightings_survey$DISTANCE)
It looks like 2km could be a reasonable threshold value to choose. Let’s set all values above 2km equal to 2km and scale the distances onto the km units scale. This rescaling to units of km is crucial for numerical stability. We define a variable ‘distance’ that contains the rescaled distances. Note that a few distances are missing. We impute these with the mean distance.
Sightings_survey$DISTANCE[Sightings_survey$DISTANCE>2000 & !is.na(Sightings_survey$DISTANCE)] <-
2000
# needs renaming to match the formula argument
Sightings_survey$distance <- Sightings_survey$DISTANCE
# There are a couple of missing distances. Impute these with the mean
Sightings_survey$distance[is.na(Sightings_survey$distance)] <- mean(Sightings_survey$distance,na.rm=T)
Sightings_survey <- Sightings_survey[,-c(8)]
Sightings_survey$distance <- Sightings_survey$distance / 1000
Next, we need to describe how inlabru should combine the components of the model: the Gaussian random field (GRF), the detection function parameter lsig, and the intercept.
We will give the GRF the name ‘mySpatial’. This name is arbitrary and can be changed. To define ‘mySpatial’ as a GRF, we will point it to the model object we defined earlier (matern_land) and will tell inlabru that the coordinates of the points are the locations at which we wish to evaluate the field at. coordinates is a special variable name reserved for the spatial coordinates of a SpatialPointsDataFrame object. The sightings data we will be using are all SpatialPointsDataFrame objects.
The scalar terms Intercept and the detection function parameter lsig are given the special arguments 1. This tells inlabru to create a scalar random variable. These will be available for every data point.
# Define the components of the model
cmp_land1 <- ~ mySpatial(main = coordinates, model = matern_land) +
lsig(1) + Intercept(1)
Note that the components do not define the formula of the model. We must define the formula separately. The formula describes how these components are combined to form the linear predictor. For our log-Gaussian Cox process, the linear predictor will be the model for log \(\lambda_{obs}\).
We now define the formula. On the left hand side of the formula are the response variables. We have two for a distance sampling dataset. The first, are the collections of locations \(Y\). The second are the recorded perpendicular distances from the aircraft.
On the right hand side of the formula, we have the GRF mySpatial. Next, we include the scalar parameter lsig. Unlike mySpatial, we include lsig in the linear predictor as an argument to the detection probability function log_hn. Notice that the log_hn function is evaluated at the response variable distance. Finally, we also include the variable Intercept.
Additionally, the offset of log(2) is required due to the unknown direction of the detections (detections can occur either side of the aircraft). For now, we ignore covariates.
formula_1 = coordinates + distance ~ mySpatial +
log_hn(distance, lsig) +
Intercept + log(2)
For clarity, the linear predictor defined in formula_1 is:
\[\textrm{log}(\lambda_{true}(s)) = \beta_0 + Z(s) \\ \textrm{log}(p(d)) = \textrm{log}(2) + \frac{-d^2}{2\sigma^2}\].
Thus, the expected encounter rate of an animal at location \(s\), a distance \(d\) away from the aircraft is:
\[\lambda_{obs}(s) = \lambda_{true}(s) p(d)\].
Note that \(Z(s)\) is the value of the GRF mySpatial evaluated at position \(s\), and \(\sigma\) is the exponential of lsig.
Defining the components and the formula separately in two stages is what gives inlabru its generalisibility. Parameters can be included within a model non-linearly, unlike most other R packages! For example, lsig was included nonlinearly through our custom function log_hn! Behind the scenes, for nonlinear models, inlabru linearises the model via a Taylor series expansion.
Furthermore, components and formulae can be stacked together for datasets from differing sources, allowing for integrative joint models to be fit with relative ease! For example, the same components can be shared across different model, but take different forms for each (e.g. Intercept in model 1, Intercept^2 in model 2, etc.,)!
For detailed additional information on the types of model components allowed by inlabru and the linearisation procedure performed, call ?component and here.
Before fitting the model, we set the following global options for inlabru.
bru_options_set(
bru_verbose = TRUE,
verbose = TRUE,
control.inla=list(int.strategy='eb'),
num.threads=2
)
We want both INLA and inlabru to display all messages about the convergence, because these messages can help with diagnosing any issues with the model fitting. To do so, we set verbose arguments to TRUE.
For the purposes of this workshop, we also tell INLA to perform an empirical Bayes analysis (control.inla=list(int.strategy='eb')) instead of a full Bayesian analysis. This speeds up computation dramatically, but fails to propagate the full uncertainties associated with the hyperparameters into inference. This is because these empirical Bayes analyses condition on the single ‘most-likely’ model defined by the posterior modes of the hyperparameters. In contrast, a fully-Bayesian analysis performs Bayesian model averaging by averaging across (possibly infinitely many) models defined by all possible combinations of hyperparameter values weighted by their prior probabilities. The discrepancies between these two strategies can be large and in a publication, int.strategy should be set to auto (the default setting).
The number of threads (similar to cores) can be chosen. The higher the number, the faster the inference due to parallelisation. To see how many threads are available for use, run detectCores(). Always save at least 1 core for running background tasks!
We are ready! We fit our first log-Gaussian Cox process model in inlabru!
We will fit the model using the like and bru approach because it allows to easily combine different data sources or data streams into an integrative joint models.
First, we define our likelihood object using the function like. We specify that the SpatialPointsDataFrame Sightings_survey is our data. Remember - it is crucial that the data are of type SpatialPointsDataFrame for the special coordinates variable to work. Next, we specify the domains for both our response variables coordinates and distance. We point coordinates to the spatial mesh mesh_land and point the distance variable to a 1D mesh that we define. Remember that we have set the upper limit to the distance equal to 2km.
Next, we point the likelihood to the formula object we created earlier that defines the linear predictor and specify that a Cox process model is being fitted by specifying family='cp'. Note that a HUGE library of likelihoods are available. Calling inla.list.models() provides a list of every possible model, prior and latent effect available. Finally, we provide a sensible (negative) initial value for lsig.
fit = like(data = Sightings_survey,
samplers = Effort_survey,
domain = list(
coordinates = mesh_land,
distance = INLA::inla.mesh.1d(seq(0, 2, length.out = 30))),
formula = formula_1,
family = 'cp',
options = list(bru_initial = list(lsig = -1)))
Next, we call the function bru. This function combines all the likelihood objects together with the specified components and performs model inference. Note that we only have a single likelihood object fit, with components cmp_land1.
Running this function may take 2-5 min.
# Warning: can be slow!
fit <- bru(fit, components = cmp_land1)
Is this a good fit? One simple way of gauging how well a model fits the data is to assess it’s DIC value. DIC is similar to AIC/BIC in that it trades of goodness-of-fit with a penalty for the complexity of the model under the Occam’s Razor principle of ‘simple explanations trump complicated ones when both predict outcomes equally well’. Whilst a lone DIC value is somewhat meaningless, we can compare the DIC values of competing models fit to the same data for comparison. The lower the DIC score, the better the model fit.
fit$dic$dic #DIC is 840. This is our benchmark
## [1] 840.0253
Let’s turn off the (slightly overwhelming) messages from INLA for the remainder of the workshop.
bru_options_set(
bru_verbose = TRUE,
verbose = FALSE,
control.inla=list(int.strategy='eb'),
num.threads=2
)
inlabru model.Now that we have fit our log-Gaussian Cox process model, what can we learn? What can we say about the whale intensity?
inlabru provides many user-friendly functions for answering these questions. Three such functions are spde.posterior, predict and pixels. Let’s see them in action.
As a first objective, let’s plot the predicted whale intensity across the domain. This plot can help to visually assess where the model predicts ‘hotspots’ or areas of high whale activity exist. The following workflow can help to make professional plots easily:
pixels.# Define pixels for plotting
plot_pixels <- pixels(mesh_land, mask = Domain)
# mask = Domain ensures we do not predict in our 'buffered' regions, but only the original Domain
predict. The package inlabru (and INLA) allows for approximate Monte Carlo samples of each random component to be generated from the posterior distributions of the model. The function predict then computes the user-specified function for each sample before taking the empirical mean, standard deviation, quantiles, etc., of these values. This approach allows the posterior distribution of arbitrarily complex nonlinear functions of parameters to be investigated.For the purposes of this workshop, we are going to use only 20 Monte Carlo samples. In practice, to reduce Monte Carlo error, this number may need to be >1000. The Monte Carlo standard error should always be reported. We set the seed so that the results are reproducible.
# Set random seed
seed <- c(12345)
# Predict the intensity defined by the function exp(mySpatial + Intercept) across the pixels
pred_int <- predict(fit, plot_pixels, ~ exp(mySpatial + Intercept), n.samples = 20, seed=seed)
# What summary statistics have been computed for the intensity?
names(pred_int)
## [1] "mean" "sd" "q0.025" "median" "q0.975" "smin" "smax" "cv"
## [9] "var"
Notice how lots of useful summary statistics are computed by default using the predict function!
gg. We will plot both the posterior mean (representing the predicted whale intensity) and the posterior standard deviation (representing the uncertainty around the prediction).# Define a 'pretty' colour palette
colsc <- function(...) {
scale_fill_gradientn(colours = rev(RColorBrewer::brewer.pal(11,"RdYlBu")),
limits = range(...))
}
# Plot the posterior mean intensity
ggplot() + gg(pred_int[1]) + gg(Domain) + gg.spatiallines_mod(Effort_survey, colour='red') +
colsc(pred_int@data$mean)
# Plot the posterior SD (note the index [2])
ggplot() + gg(pred_int[2]) + gg(Domain) + gg.spatiallines_mod(Effort_survey, colour='red') +
colsc(pred_int@data$sd)
From these plots, we see 3 interesting features. First, a very low whale intensity is predicted in the regions frequently visited by the whale-watching companies. Second, hotspots are identified in the Western region that had lots of survey effort focused in it, and in the center of the study region.
Finally, evidence of a strong boundary effect in the South Eastern corner is seen. A very large mean and standard deviation is seen, indicating that some numerical instability may have occurred in this region. This can occur when the mesh becomes ‘ugly’, with tight triangles trying to recreate a jagged boundary. Whilst the large standard deviation implies that this should not impact estimates of posterior mean population size, it may impact any future estimates of uncertainty. One future solution could be to improve the mesh in this corner. An alternative conclusion could be that we are fitting a model that is too complicated given the (very) limited size of the dataset.
Next, let’s look at the posterior distributions of the spatial range and the marginal variance of the random field. Remember: the spatial range (in km) is estimating the ‘approximate distance required between two points for the values of the intensity their to be approximately independent’. For this, we use the functions spde.posterior and plot.
spde.range <- spde.posterior(fit, "mySpatial", what = "range")
plot(spde.range)
spde.var <- spde.posterior(fit, "mySpatial", what = "variance")
plot(spde.var)
We can see that the model is estimating a spatial field with a reasonably large spatial range and small variance. No evidence of severe overfitting is shown (a very large variance with a tiny spatial range can be suspicious).
Next, we can plot the estimated detection probability function, with uncertainty intervals. For this, we use predict once again, with one slight modification. Instead of using pixels to define a set of SpatialPixels to plot over, we must manually specify a data.frame object with the desired values of distance to predict across. Then, we can use the function plot as before. Note: I have had some issues with using the predict function on non-spatial data.frames. If you get an error message, see Additional tips.
distdf <- data.frame(distance = seq(0, 2, length = 100))
dfun <- predict(fit, distdf, ~ exp(log_hn(distance, lsig)), n.samples = 20, seed=seed)
plot(dfun)
This shows the probability of detecting a whale (given that it is at the ‘surface’), as a function of distance from the vessel. The posterior mean and 95% credible intervals are shown. Note that for a publication, n.samples may need to be much higher. To estimate the Monte Carlo standard error at a distance 1, we can do the following:
dfun$sd[50]/sqrt(20) #sqrt(20) by central limit theorem
## [1] 0.02228078
This is huge!
inlabru provides functions for predicting the population size! That is, we can investigate the posterior distribution for the expected number of whales with the following caveats:
We are assuming that every encounter made was with a single individual
We are assuming that all individuals can be spotted at distance 0m!
We have not factored in availability bias (perhaps whales are diving and missed)
We have not factored in the influence of group size on detection probability!
We are assuming sufficient mixing time allowed between encounters
Ignoring these issues for now, we predict the total population size falling within the Domain. To achieve this, we use the function ipoints. This function sets a series of integration points with weights (areas), to allow us to approximate the integral of the intensity surface across the domain with a Riemann sum approximation. Basically, it approximates: \[\int_\Omega \mathbf{E}_Z \left( \mathbf{E}( \lambda_{true}(s) | Z) \right) ds\]
with \[\sum_{A_i} \mathbf{E}_Z \left( \mathbf{E}( \hat{\bar{\lambda}}_{true}(s_i) | Z) |A_i| \right) ds\]
Note that ipoints creates a SpatialPointsDataFrame object with a single named variable weight containing the area of each region corresponding to each integration point.
predpts <- ipoints(Domain, mesh_land)
head(predpts$weight) # these are the areas of the regions corresponding each point.
## [1] 42.58444 87.53765 162.51207 132.87393 12.59241 107.60306
Then, we use predict, with the user-specified function equal to a weighted sum of the intensity. Notice that we have omitted the detection probability function log_hn. Thus, we are predicting across \(\Omega\), assuming perfect detectability.
# Define the weighted sum function
Lambda <- predict(fit, predpts, ~ sum(weight * exp(mySpatial + Intercept)), n.samples = 20, seed=seed)
Lambda
## mean sd q0.025 median q0.975 smin smax cv
## 1 649.928 111.7442 501.8101 612.5868 872.4116 472.8683 896.6082 0.1719331
## var
## 1 12486.76
The population estimated is approximately 650 individuals with a large variance of approximately 1.210^{4}.
There remains a potential problem with our inference. If we look back at the plot of our domain, we notice that it extends in the South-Western direction far beyond our survey tracklines. Thus, there is a risk of extrapolation error! Ideally, we should restrict our predictions to lie ‘close’ to our effort!
Part A. Plot a map of the whale intensity, but colour grey all pixels that lie over 30km away from the nearest survey trackline.
Hint 1: To extract a subset of a SpatialPoints object X that lies within a 30km distance of another Spatial object Y, we run the following code:
X_subset < X[which(apply(gWithinDistance(X,Y,dist = 30,byid = T),2,sum)>0),]
Hint 2: We will need to convert the SpatialPixels intensity object into a SpatialPoints before subsetting. To convert a SpatialPixels object A to a SpatialPoints object B run the following:
B <- as(A, 'SpatialPointsDataFrame')
If you get really stuck, feel free to show the answer code!
# Convert the SpatialPixels into a SpatialPoints object before subsetting
pred_int_points <- as(pred_int,'SpatialPoints')
# Subset to 30 km from Effort lines
pred_int_restricted <-
pred_int[which(apply(gWithinDistance(pred_int_points, Effort_survey,dist = 30,byid = T),2,sum)>0),]
# Plot the posterior mean
ggplot() + gg(pred_int_restricted[1]) + gg(Domain) + gg.spatiallines_mod(Effort_survey, colour='red') +
colsc(pred_int_restricted@data$mean)
# Plot the standard deviation
ggplot() + gg(pred_int_restricted[2]) + gg(Domain) + gg.spatiallines_mod(Effort_survey, colour='red') +
colsc(pred_int_restricted@data$sd)
Part B. Estimate the total number of whales in the region lying within 30km of the nearest trackline.
If you get really stuck, feel free to show the answer code!
# Subsetting the ipoints that contain the weights (area)
predpts_restricted <- predpts[
which(apply(gWithinDistance(predpts,Effort_survey,dist = 30,byid = T),2,sum)>0),]
# Using predict to
Lambda_restricted <- predict(fit, predpts_restricted, ~ sum(weight * exp(mySpatial + Intercept)), n.samples = 20, seed=seed)
Lambda_restricted
## mean sd q0.025 median q0.975 smin smax cv
## 1 491.2889 76.10782 399.2922 475.9487 655.67 396.3423 666.7528 0.1549146
## var
## 1 5792.401
You should get a smaller population estimate of approximately 500 individuals and a smaller, but still large, variance of approximately 6000. Don’t worry if your answers are slightly different - we’ve only used 20 samples!
inlabru modelAdding environmental covariates to models is made easy in inlabru. All that is required is that each covariate is converted to class SpatialPixelsDataFrame and that the covariate data are complete. By ‘complete’, we mean that a non-missing value is available at every point within the region defined by the mesh boundary.
Let’s add the log_Depth covariate. To do this, we need to redefine the components and formula. First we define the updated components.
cmp_land2 <- ~ mySpatial(main = coordinates, model = matern_land) +
Intercept(1) + log_Depth(main = log_Depth, model='linear') +
lsig(1)
Notice that we have defined a new variable called log_Depth. Using the argument main, we point the variable to the SpatialPixelsDataFrame object log_Depth. Finally, we tell inlabru that we are fitting a linear effect.
Next, we redefine our new formula object:
formula_2 = coordinates + distance ~ mySpatial +
Intercept + log_Depth +
log_hn(distance, lsig) +
log(2)
The only change here, is the addition of the linear effect log_Depth to the linear predictor. For clarity, the linear predictor defined in formula_2 is:
\[\textrm{log}(\lambda_{true}(s)) = \beta_0 + \beta_1 X(s) + Z(s) \\ \textrm{log}(p(d)) = \textrm{log}(2) + \frac{-d^2}{2\sigma^2}\].
Thus, the expected encounter rate of an animal at location \(s\), a distance \(d\) away from the aircraft is:
\[\lambda_{obs}(s) = \lambda_{true}(s) p(d)\].
Note that \(X(s)\) is the value of log_Depth at position \(s\), \(Z(s)\) is the value of the GRF mySpatial evaluated at position \(s\), and \(\sigma\) is the exponential of lsig.
Now we can fit the model using the like/bru approach:
Step 1, define the likelihood
fit2 = like(data = Sightings_survey,
samplers = Effort_survey,
domain = list(
coordinates = mesh_land,
distance = INLA::inla.mesh.1d(seq(0, 2, length.out = 30))),
formula = formula_2,
family = 'cp')
Step 2, fit the model with bru.
This will take 2-5 min to run.
# Slow to run bru
fit2 <- bru(fit2, components = cmp_land2)
## iinla: Iteration 1 [max:10]
## iinla: Iteration 2 [max:10]
## iinla: Max deviation from previous: 5.14% of SD [stop if: <1%]
## iinla: Iteration 3 [max:10]
## iinla: Max deviation from previous: 2.27% of SD [stop if: <1%]
## iinla: Iteration 4 [max:10]
## iinla: Max deviation from previous: 2.41% of SD [stop if: <1%]
## iinla: Iteration 5 [max:10]
## iinla: Max deviation from previous: 2.69% of SD [stop if: <1%]
## iinla: Iteration 6 [max:10]
## iinla: Max deviation from previous: 2.01% of SD [stop if: <1%]
## iinla: Iteration 7 [max:10]
## iinla: Max deviation from previous: 0.897% of SD [stop if: <1%]
## iinla: Convergence criterion met, running final INLA integration with known theta mode.
## iinla: Iteration 8 [max:10]
We can then explore the results.
fit2$dic$dic # DIC is 840 again - no improvement
## [1] 840.0886
summary(fit2)
## inlabru version: 2.2.4.9000
## INLA version: 21.01.08-1
## Components:
## mySpatial: Model types main='spde', group='exchangeable', replicate='iid'
## Intercept: Model types main='linear', group='exchangeable', replicate='iid'
## log_Depth: Model types main='linear', group='exchangeable', replicate='iid'
## lsig: Model types main='linear', group='exchangeable', replicate='iid'
## Likelihoods:
## Family: 'cp'
## Data class: 'SpatialPointsDataFrame'
## Predictor:
## coordinates + distance ~ mySpatial + Intercept + log_Depth +
## log_hn(distance, lsig) + log(2)
## Time used:
## Pre = 8.05, Running = 6.84, Post = 1.07, Total = 16
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## Intercept -5.821 0.849 -7.537 -5.804 -4.201 -5.770 0
## log_Depth -0.196 0.151 -0.487 -0.198 0.106 -0.202 0
## lsig 0.063 0.152 -0.250 0.068 0.346 0.078 0
##
## Random effects:
## Name Model
## mySpatial SPDE2 model
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant 0.975quant mode
## Range for mySpatial 244.722 118.779 97.861 217.155 550.11 174.68
## Stdev for mySpatial 0.896 0.241 0.511 0.866 1.45 0.81
##
## Expected number of effective parameters(stdev): 23.34(0.00)
## Number of equivalent replicates : 511.74
##
## Deviance Information Criterion (DIC) ...............: 840.09
## Deviance Information Criterion (DIC, saturated) ....: NaN
## Effective number of parameters .....................: 21.42
##
## Watanabe-Akaike information criterion (WAIC) ...: 835.84
## Effective number of parameters .................: 16.60
##
## Marginal log-Likelihood: -441.67
## Posterior marginals for the linear predictor and
## the fitted values are computed
So, we find that adding log_Depth to the model fails to improve the model fit, as judged by DIC. Furthermore, inspecting the output from summary(), we see that the 95% credible intervals cover zero. Thus, we fail to detect a significant effect of log_Depth.
Fit a model, with log_Slope included. Does this improve the model fit? Are either of the variables found to be significantly associated with the whale intensity?
Hint: Remember that there are multiple steps.
Part A. Define the model.
cmp_land3 <- ~ mySpatial(main = coordinates, model = matern_land) +
Intercept(1) + log_Slope(main = log_Slope, model='linear') +
lsig(1)
formula_3 = coordinates + distance ~ mySpatial +
Intercept + log_Slope +
log_hn(distance, lsig) +
log(2)
Part B. Define the likelihood with like.
fit3 = like(data = Sightings_survey,
samplers = Effort_survey,
domain = list(
coordinates = mesh_land,
distance = INLA::inla.mesh.1d(seq(0, 2, length.out = 30))),
formula = formula_3,
family = 'cp')
Part C. Fit the model with bru.
fit3 <- bru(fit3, components = cmp_land3)
## iinla: Iteration 1 [max:10]
## iinla: Iteration 2 [max:10]
## iinla: Max deviation from previous: 7.42% of SD [stop if: <1%]
## iinla: Iteration 3 [max:10]
## iinla: Max deviation from previous: 6.47% of SD [stop if: <1%]
## iinla: Iteration 4 [max:10]
## iinla: Max deviation from previous: 6.46% of SD [stop if: <1%]
## iinla: Iteration 5 [max:10]
## iinla: Max deviation from previous: 4.57% of SD [stop if: <1%]
## iinla: Iteration 6 [max:10]
## iinla: Max deviation from previous: 4.84% of SD [stop if: <1%]
## iinla: Iteration 7 [max:10]
## iinla: Max deviation from previous: 2.08% of SD [stop if: <1%]
## iinla: Iteration 8 [max:10]
## iinla: Max deviation from previous: 1.33% of SD [stop if: <1%]
## iinla: Iteration 9 [max:10]
## iinla: Max deviation from previous: 3.08% of SD [stop if: <1%]
## iinla: Maximum iterations reached, running final INLA integration.
## iinla: Iteration 10 [max:10]
fit3$dic$dic # DIC is 841.6 - no improvement
## [1] 841.5942
summary(fit3)
## inlabru version: 2.2.4.9000
## INLA version: 21.01.08-1
## Components:
## mySpatial: Model types main='spde', group='exchangeable', replicate='iid'
## Intercept: Model types main='linear', group='exchangeable', replicate='iid'
## log_Slope: Model types main='linear', group='exchangeable', replicate='iid'
## lsig: Model types main='linear', group='exchangeable', replicate='iid'
## Likelihoods:
## Family: 'cp'
## Data class: 'SpatialPointsDataFrame'
## Predictor:
## coordinates + distance ~ mySpatial + Intercept + log_Slope +
## log_hn(distance, lsig) + log(2)
## Time used:
## Pre = 8.05, Running = 9.18, Post = 1.05, Total = 18.3
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## Intercept -6.795 0.393 -7.571 -6.793 -6.029 -6.789 0
## log_Slope -0.021 0.295 -0.630 -0.010 0.527 0.011 0
## lsig 0.062 0.152 -0.250 0.067 0.346 0.077 0
##
## Random effects:
## Name Model
## mySpatial SPDE2 model
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant 0.975quant mode
## Range for mySpatial 197.175 87.714 84.383 177.987 420.39 147.171
## Stdev for mySpatial 0.838 0.202 0.498 0.821 1.28 0.789
##
## Expected number of effective parameters(stdev): 25.28(0.00)
## Number of equivalent replicates : 472.36
##
## Deviance Information Criterion (DIC) ...............: 841.59
## Deviance Information Criterion (DIC, saturated) ....: NaN
## Effective number of parameters .....................: 22.97
##
## Watanabe-Akaike information criterion (WAIC) ...: 837.71
## Effective number of parameters .................: 18.41
##
## Marginal log-Likelihood: -441.89
## Posterior marginals for the linear predictor and
## the fitted values are computed
Try adding both log_Slope and log_Depth variables. Does this improve the fit?
Part A. Define the model.
cmp_land4 <- ~ mySpatial(main = coordinates, model = matern_land) +
Intercept(1) + log_Depth(main = log_Depth, model='linear') +
log_Slope(main = log_Slope, model = 'linear') +
lsig(1)
formula_4 = coordinates + distance ~ mySpatial +
Intercept + log_Depth +
log_Slope +
log_hn(distance, lsig) +
log(2)
Part B. Define the likelihood with like.
fit4 = like(data = Sightings_survey,
samplers = Effort_survey,
domain = list(
coordinates = mesh_land,
distance = INLA::inla.mesh.1d(seq(0, 2, length.out = 30))),
formula = formula_4,
family = 'cp')
Part C. Fit the model with bru.
fit4 <- bru(fit4, components = cmp_land4)
## iinla: Iteration 1 [max:10]
## iinla: Iteration 2 [max:10]
## iinla: Max deviation from previous: 6.14% of SD [stop if: <1%]
## iinla: Iteration 3 [max:10]
## iinla: Max deviation from previous: 4.06% of SD [stop if: <1%]
## iinla: Iteration 4 [max:10]
## iinla: Max deviation from previous: 6.01% of SD [stop if: <1%]
## iinla: Iteration 5 [max:10]
## iinla: Max deviation from previous: 1.52% of SD [stop if: <1%]
## iinla: Iteration 6 [max:10]
## iinla: Max deviation from previous: 2.97% of SD [stop if: <1%]
## iinla: Iteration 7 [max:10]
## iinla: Max deviation from previous: 1.71% of SD [stop if: <1%]
## iinla: Iteration 8 [max:10]
## iinla: Max deviation from previous: 1.68% of SD [stop if: <1%]
## iinla: Iteration 9 [max:10]
## iinla: Max deviation from previous: 3.18% of SD [stop if: <1%]
## iinla: Maximum iterations reached, running final INLA integration.
## iinla: Iteration 10 [max:10]
fit4$dic$dic # DIC is 840 - no improvement
## [1] 840.8905
summary(fit4)
## inlabru version: 2.2.4.9000
## INLA version: 21.01.08-1
## Components:
## mySpatial: Model types main='spde', group='exchangeable', replicate='iid'
## Intercept: Model types main='linear', group='exchangeable', replicate='iid'
## log_Depth: Model types main='linear', group='exchangeable', replicate='iid'
## log_Slope: Model types main='linear', group='exchangeable', replicate='iid'
## lsig: Model types main='linear', group='exchangeable', replicate='iid'
## Likelihoods:
## Family: 'cp'
## Data class: 'SpatialPointsDataFrame'
## Predictor:
## coordinates + distance ~ mySpatial + Intercept + log_Depth +
## log_Slope + log_hn(distance, lsig) + log(2)
## Time used:
## Pre = 8.33, Running = 9.38, Post = 1.21, Total = 18.9
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## Intercept -5.767 0.841 -7.501 -5.738 -4.199 -5.681 0
## log_Depth -0.239 0.164 -0.541 -0.247 0.104 -0.262 0
## log_Slope 0.185 0.332 -0.505 0.198 0.801 0.224 0
## lsig 0.063 0.152 -0.250 0.068 0.346 0.078 0
##
## Random effects:
## Name Model
## mySpatial SPDE2 model
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant 0.975quant mode
## Range for mySpatial 244.396 120.392 99.01 215.321 557.72 171.715
## Stdev for mySpatial 0.942 0.248 0.56 0.907 1.52 0.839
##
## Expected number of effective parameters(stdev): 25.28(0.00)
## Number of equivalent replicates : 472.51
##
## Deviance Information Criterion (DIC) ...............: 840.89
## Deviance Information Criterion (DIC, saturated) ....: NaN
## Effective number of parameters .....................: 22.92
##
## Watanabe-Akaike information criterion (WAIC) ...: 836.97
## Effective number of parameters .................: 18.32
##
## Marginal log-Likelihood: -445.96
## Posterior marginals for the linear predictor and
## the fitted values are computed
The DIC results indicate that there is no improvement, suggesting that depth and slope are not important covariates.
If you’re interested, try fitting a model with log_Depth, log_Slope and log_Depth_sq all included. Notice that severe model instabilities are encountered. These can be seen by looking at the verbose output from inlabru. Notice how the deviation % does not converge towards zero. If you have extra time, notice how this instability leads to WILD predictions of population size! Fortunately you should notice that this numerical instability is detected with a higher DIC - phew!
So, we ultimately find that the model without covariates (fit) offers the best model fit to the survey dataset. The failure to detect a covariate effect should not come as a major surprise. The survey datasets are quite limited in size, with only 63 sightings in total! In fact, a LGCP model should be used with extreme caution with such a small dataset.
That being said, the DIC values were almost equivalent for fit and fit2 - the model which included log_Depth. Let’s see how the predictions and parameter estimates change under the second model. Are the conclusions drawn from each model consistent?
plot_pixels2 <- pixels(mesh_land, mask = Domain)
pred_int2 <- predict(fit2, plot_pixels2,
~ exp(mySpatial + Intercept + log_Depth), n.samples = 20, seed=seed)
multiplot(ggplot() + gg(pred_int2[1]) + gg(Domain) + gg.spatiallines_mod(Effort_survey, colour='red') +
colsc(c(pred_int2@data$mean,pred_int@data$mean)) + ggtitle('Covariate Model Mean'),
ggplot() + gg(pred_int2[2]) + gg(Domain) + gg.spatiallines_mod(Effort_survey, colour='red') +
colsc(c(pred_int2@data$sd,pred_int@data$sd)) + ggtitle('Covariate Model SD'),
ggplot() + gg(pred_int[1]) + gg(Domain) + gg.spatiallines_mod(Effort_survey, colour='red') +
colsc(c(pred_int2@data$mean,pred_int@data$mean)) + ggtitle('Covariate-Free Model Mean'),
ggplot() + gg(pred_int[2]) + gg(Domain) + gg.spatiallines_mod(Effort_survey, colour='red') +
colsc(c(pred_int2@data$sd,pred_int@data$sd)) + ggtitle('Covariate-Free Model SD'),
layout = matrix(c(1:4),nrow=2,ncol=2,byrow = F))
# Look at parameters of spde model
spde.range2 <- spde.posterior(fit2, "mySpatial", what = "range")
spde.var2 <- spde.posterior(fit2, "mySpatial", what = "variance")
multiplot(plot(spde.range2)+ylim(range(spde.range$pdf)),plot(spde.range)+xlim(range(spde.range2$range)),
plot(spde.var2)+ylim(range(spde.var$pdf)),plot(spde.var)+xlim(range(spde.var2$variance)),
layout = matrix(1:4,2,2,byrow = F))
Next, we use predict and plot to display the estimated effect of log_Depth on the intensity. Note that the effect is nonlinear on the intensity scale, as the covariate is specified to be linear on the log intensity scale.
# plot the covariate effects
depthdf <- data.frame(log_Depth = seq(from=min(log_Depth@data$log_Depth),
to=max(log_Depth@data$log_Depth),
length.out = 100))
depthpred <- predict(fit2, depthdf, ~ exp(log_Depth), n.samples=40, seed=seed)
plot(depthpred)
# Manually if the predict function breaks
#samples <- generate(fit2, n.samples = 20,, seed=seed)
# depthpred <- sapply(samples,FUN = function(x){return(depthdf$log_Depth*x$log_Depth)})
# ggplot(data.frame(mean=apply(depthpred,1, mean),
# LCL=apply(depthpred,1, quantile, probs=c(0.025)),
# UCL=apply(depthpred,1, quantile, probs=c(0.975)),
# logdepth=seq(from=min(log_Depth@data$log_Depth),
# to=max(log_Depth@data$log_Depth),
# length.out = 100)),
# aes(y=mean,x=logdepth,ymax=UCL,ymin=LCL)) +
# geom_line() + geom_ribbon(alpha=0.4)
Note that the credible intervals cover a huge range of effect sizes! We really fail to detect an effect of log_Depth from the survey data alone, and including it as a covariate may lead to predictions and inference with lower precision!
Finally, we plot the estimated population size across both the entire domain and the restricted domain (restricted to points lying within 30km of the nearest survey trackline). We then plot the estimated detection function. Notice that the predict function fails to work for the variable distance! This is a bug in inlabru that will hopefully be corrected for soon. In the meantime, we manually extract the results by sampling all the components using the function generate and then manually combining them using sapply. Note that the variables LCL and UCL define a symmetric 95% credible interval.
# Plot detection probability function
distdf <- data.frame(distance = seq(0, 2, length = 100))
#dfun2 <- predict(fit2, distdf, ~exp(log_hn(distance,lsig)),n.samples = 20)
# If this calls an error run below
samples <- generate(fit2, n.samples = 20, seed=seed)
dfun2 <- sapply(samples,FUN = function(x){return(exp(log_hn(distdf$distance,x$lsig)))})
ggplot(data.frame(mean=apply(dfun2,1, mean),
LCL=apply(dfun2,1, quantile, probs=c(0.025)),
UCL=apply(dfun2,1, quantile, probs=c(0.975)),
distance=distdf$distance),
aes(y=mean,x=distance,ymax=UCL,ymin=LCL)) +
geom_line() + geom_ribbon(alpha=0.4)
# We can look at the posterior for expected number of whales:
predpts <- ipoints(Domain, mesh_land)
# It looks like we do not need to do this!
Lambda2 <- predict(fit2, predpts, ~ sum(weight * exp(mySpatial + Intercept +
log_Depth)),
n.samples=20, seed=seed)
Lambda2
## mean sd q0.025 median q0.975 smin smax cv
## 1 635.8464 196.2485 357.0863 614.3456 1040.251 311.5598 1226.739 0.3086413
## var
## 1 38513.47
Lambda
## mean sd q0.025 median q0.975 smin smax cv
## 1 649.928 111.7442 501.8101 612.5868 872.4116 472.8683 896.6082 0.1719331
## var
## 1 12486.76
# Avoid extrapolation
Lambda2_restricted <- predict(fit2, predpts_restricted,
~ sum(weight * exp(mySpatial + Intercept +
log_Depth)),
n.samples=20, seed=seed)
Lambda2_restricted
## mean sd q0.025 median q0.975 smin smax cv
## 1 470.7397 94.30338 298.6168 469.3901 602.3382 258.4846 610.2239 0.2003302
## var
## 1 8893.128
Lambda_restricted
## mean sd q0.025 median q0.975 smin smax cv
## 1 491.2889 76.10782 399.2922 475.9487 655.67 396.3423 666.7528 0.1549146
## var
## 1 5792.401
We indeed see very similar estimates for the population size, with overlapping credible intervals. Interestingly, it appears estimates from the model with log_Depth (fit2) have a much greater uncertainty! Remember that in a publication we would draw far more than 20 samples!
Try fitting a model with sea surface temperature (SST) included. If you are interested, below is the code used to download and reformat the SST data from the ERDDAP database. If not, skip ahead to read in the pre-compiled SST object from file.
library(rerddap)
library(tidyverse)
rerddap::info('erdMH1sstdmday')
# what are the lat/lon ranges of the Domain
spTransform(Domain_smoothed,CRS('+proj=longlat +datum=WGS84 +no_defs'))@bbox
# what are the date ranges
range(Sightings_survey@data$DATE_LO)
SST <- rerddap::griddap(rerddap::info('erdMH1sstdmday'),
time=c("2007-06-01","2009-08-31"),
latitude=c(40.1898,47.44040),
longitude=c(-67.7430,-55.08343))
unique(SST$data$time) # Monthly data
# subset to the months we want
SST <- SST$data[which(month(SST$data$time) %in% c(6,7,8)),]
# average over the months - we are estimating a 'summer average' intensity
SST$latlonfac <- as.factor(paste0(SST$lat,SST$lon))
#SST <- as_tibble(SST)
SST <-
SST %>%
select(lat,lon,latlonfac,sst) %>%
group_by(latlonfac) %>%
mutate(sst_avg = mean(sst, na.rm = TRUE)) %>%
distinct(latlonfac,.keep_all=T) %>%
select(lat,lon,sst_avg)
SST_sp <- SpatialPixelsDataFrame(
points=cbind(SST$lon,SST$lat),
data=data.frame(SST$sst_avg),
proj4string = CRS('+proj=longlat +datum=WGS84 +no_defs'),
tolerance = 0.001
)
# reduce the resolution by a factor of 2 and then project onto the correct CRS
SST_sp <- as(projectRaster(raster(SST_sp),crs=Can_proj),'SpatialPixelsDataFrame')
ggplot() + gg(SST_sp) + gg(Domain)
# extend the covariate by nearest neighbour imputation
pixels_Domain <- pixels( mesh_land,mask=gSimplify(gBuffer(Domain,width=19),tol=3),
nx=400,ny=400)
pixels_Domain$SST <- over(pixels_Domain,SST_sp)$SST.sst_avg
# impute missing values with the nearest neighbour value
pixels_Domain$SST[is.na(pixels_Domain$SST)] <-
SST_sp$SST.sst_avg[nncross(as(SpatialPoints(pixels_Domain@coords[which(is.na(pixels_Domain$SST)),]),'ppp'),
as(SpatialPoints(SST_sp@coords),'ppp'),
what = 'which')]
SST_sp <- pixels_Domain
ggplot() + gg(SST_sp) + gg(Domain)
saveRDS(SST_sp, './Data/Covariates/SST_layer.rds')
SST_sp <- readRDS('./Data/Covariates/SST_layer.rds')
So, with SST now loaded, do we find a significant association between whale intensity and SST? Does it improve the model fit?
cmp_land_extra <- ~ mySpatial(main = coordinates, model = matern_land) +
Intercept(1) + SST(main = SST_sp, model='linear') +
lsig(1)
formula_extra = coordinates + distance ~ mySpatial +
Intercept + SST +
log_hn(distance, lsig) +
log(2)
fitextra = like(data = Sightings_survey,
samplers = Effort_survey,
domain = list(
coordinates = mesh_land,
distance = INLA::inla.mesh.1d(seq(0, 2, length.out = 30))),
formula = formula_extra,
family = 'cp')
fitextra <- bru(fitextra, components = cmp_land_extra)
## iinla: Iteration 1 [max:10]
## iinla: Iteration 2 [max:10]
## iinla: Max deviation from previous: 5.11% of SD [stop if: <1%]
## iinla: Iteration 3 [max:10]
## iinla: Max deviation from previous: 7.45% of SD [stop if: <1%]
## iinla: Iteration 4 [max:10]
## iinla: Max deviation from previous: 4.13% of SD [stop if: <1%]
## iinla: Iteration 5 [max:10]
## iinla: Max deviation from previous: 14.1% of SD [stop if: <1%]
## iinla: Iteration 6 [max:10]
## iinla: Max deviation from previous: 4.89% of SD [stop if: <1%]
## iinla: Iteration 7 [max:10]
## iinla: Max deviation from previous: 8.8% of SD [stop if: <1%]
## iinla: Iteration 8 [max:10]
## iinla: Max deviation from previous: 1.36% of SD [stop if: <1%]
## iinla: Iteration 9 [max:10]
## iinla: Max deviation from previous: 2.55% of SD [stop if: <1%]
## iinla: Maximum iterations reached, running final INLA integration.
## iinla: Iteration 10 [max:10]
fitextra$dic$dic # DIC is 841.7 - no improvement
## [1] 841.6619
summary(fitextra)
## inlabru version: 2.2.4.9000
## INLA version: 21.01.08-1
## Components:
## mySpatial: Model types main='spde', group='exchangeable', replicate='iid'
## Intercept: Model types main='linear', group='exchangeable', replicate='iid'
## SST: Model types main='linear', group='exchangeable', replicate='iid'
## lsig: Model types main='linear', group='exchangeable', replicate='iid'
## Likelihoods:
## Family: 'cp'
## Data class: 'SpatialPointsDataFrame'
## Predictor:
## coordinates + distance ~ mySpatial + Intercept + SST + log_hn(distance,
## lsig) + log(2)
## Time used:
## Pre = 7.22, Running = 9.69, Post = 0.921, Total = 17.8
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## Intercept -6.960 2.563 -11.808 -7.024 -1.738 -7.154 0
## SST 0.010 0.168 -0.335 0.015 0.326 0.025 0
## lsig 0.063 0.152 -0.250 0.068 0.347 0.078 0
##
## Random effects:
## Name Model
## mySpatial SPDE2 model
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant 0.975quant mode
## Range for mySpatial 124.741 27.934 63.508 126.206 175.454 141.278
## Stdev for mySpatial 0.745 0.087 0.542 0.754 0.886 0.812
##
## Expected number of effective parameters(stdev): 25.86(0.00)
## Number of equivalent replicates : 461.75
##
## Deviance Information Criterion (DIC) ...............: 841.66
## Deviance Information Criterion (DIC, saturated) ....: NaN
## Effective number of parameters .....................: 23.64
##
## Watanabe-Akaike information criterion (WAIC) ...: 836.83
## Effective number of parameters .................: 18.02
##
## Marginal log-Likelihood: -441.22
## Posterior marginals for the linear predictor and
## the fitted values are computed
In this session, we have learned how to fit log-Gaussian Cox process models to survey datasets that followed distance sampling protocols within a Bayesian framework. We have seen how to compare models, how to predict intensity across space, and how to estimate population size.
Coming up next, we will see how to incorporate presence-only datasets into our models. Crucially, we will find that by incorporating presence-only datasets into our models, we may be able to uncover previously hidden species-environment relationships, and use these to improve the resolution of our maps of whale intensity!
An alternative ‘solution’ to the jagged mesh problem is to define the mesh within a much larger convex hull around the coastline. Unfortunately, this approach will not take into account the coastline and hence will measure spatial ‘distance’ as euclidean distance and not ‘ocean-distance’. One such mesh is defined below:
# Compare with a mesh that ignores the land barrier
mesh_noland <- inla.mesh.2d(boundary = Domain_smoothed,
min.angle = 25,
max.edge = c(45,60),
cutoff = 30,
offset = c(10,20))
mesh_noland$crs <- Can_proj
plot(mesh_noland)
mesh_noland$n
## [1] 538
We ignore this mesh for the remainder of this session, since it fails to account for land as a barrier.
predict on non-spatial data.framesI have had some issues with using the predict function on non-spatial data.frames. If you get an error message, try using the generate - sapply method shown below.
samples <- generate(fit, n.samples = 20, seed=seed)
distpred <- sapply(samples,FUN = function(x){return(exp(log_hn(distdf$dist,x$lsig)))})
ggplot(data.frame(probability=apply(distpred,1, mean),
LCL=apply(distpred,1, quantile, probs=c(0.025)),
UCL=apply(distpred,1, quantile, probs=c(0.975)),
distance=distdf$distance),
aes(y=probability,x=distance,ymax=UCL,ymin=LCL)) +
geom_line() + geom_ribbon(alpha=0.4)
Fuglstad, Geir-arne, Daniel Simpson, Finn Lindgren, and Håvard Rue. 2019. “Constructing priors that penalize the Complexity of Gaussian random fields.” Journal of the American Statistical Association 1459 (525): 445–52. https://doi.org/10.1080/01621459.2017.1415907.
Lindgren, Finn, and Håvard Rue. 2011. “An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic.” Journal of the Royal Statistical Society B 73 (4): 423–98.
Simpson, Daniel, Håvard Rue, Andrea Riebler, Thiago G Martins, and Sigrunn H Sørbye. 2017. “Penalising model component complexity: a principled, practical approach to constructing priors.” Statistical Science 32 (1): 1–28. https://doi.org/10.1214/16-STS576.
Alternatively we could use plot(mesh_land)↩︎